Memo

From: PHL Geo Inc. TO: Philadelphia City Council Subject: Introduction of a new development monitoring tool

Dear City Council Members,

We are PHL Geo Inc., a Philadelphia based civil-service oriented coding organization using the power of geospatial analysis and city-owned data to provide informed insights to decision makers such as yourself. We are reaching out to introduce an exciting new tool which can be used in your districts immediately. The tool is called the GENTRISK and uses 10 years of development information to predict where development may happen in your community in the near future. The GENTRISK may be used in your district to manage the negative impacts of development, including trash dumping, street closures, and noise complaints. Getting ahead of development impacts can improve coordination and communication with residents in your district, improve the likelihood of successful permit applications, and enable smart zoning and growth in your city council region.

Development is a process determined by numerous factors and can have positive effects; however, development-induced gentrification, which is “a process a neighborhood change that includes economic change in a historically disinvested neighborhood - by means of real estate investment and new higher-income residents moving in - as well as demographic change - not only in terms of income level, but also in terms of changes in the education level or racial make-up of residents,” can have negative effects (Demsas, 2021). As city council members looking to promote econoimc development while ensuring that residents have the protections that they need, the GENTRISK can provide tools that will help you see where development pressures are happening and engage in proactive outreach to those communities. Aligning city services such as increaseing access to Philadelhpia’s Homestead Exemption or to the PA Homeowner Assistance fund, can reduce these negative impacts.

The GENTRISK uses information from the past 10 years of development and combines that with external factors including the distance to schools, tree density, and distance to public transit, in order to understand what is driving development in Philadelphia. The GENTRISK then uses the relationships between those factors to predict where development may occur so you can get ahead of any potential negative impacts. These predictions form the basis of several tools available in GENTRISK, such as scenario planning. In the scenario planning framework, you may modify inputs to see how they will impact development going forward. For example, if the new Philly Tree Plan is likely to bring more tree canopy to your community, you can use the GENTRISK to model how much further development is likely to occur.

The following document provides a technical overview of how the GENTRISK functions, how to interpret its results, and how it can be incorporated into your city planning techniques. We look forward to seeing GENTRISK integrated into your city planning toolkit.

For questions, comments, concerns, or to learn more, please email

Kindly, Akira DiSandro Benjamin Myers

PHL GEO Inc. Founders

Technical Details

The GENTRISK tool uses the construction of new housing and commercial buildings as its core output, as construction provides both opportunities (in the form of economic development and increased tax base) and pressures (in the form of gentrification and noise/trash) on residents and neighborhoods. Construction information is available from the Department of Licenses and Inspection for a time period stretching over the last 10 years. While neighborhoods are constantly in a state of change, focusing in on a more specific time period can provide City Council with an accurate understanding of what is driving development in the city. THe time period from 2013 to 2018 provides a good baseline of understanding development drivers as it is prior to the impacts that the COVID-19 global pandemic had on development drivers, yet recent enough that certain key drivers - such as the turnover of permanent residents for renters - are still prevalent within Philadelphia.

Data Exploration

The model powering GENTRISK requires that all of the variables are gathered in the same format and then compared across time. To facilitatate this process, the PHL GEO Inc. team used the following code to gather all of the variables and put them into a similar format. These variables can then provide an understnaidng of how neighborhoods are changing over time, followed by an investigation of what could be driving those changes; ie, if development is happening in a neighborhood, what are the factors that drive that development?

Census

A key way to contextualize these changes is to look at the changes that are happening within city council ditsrcts throughout Philadelphia. AS is shown through the map below, there are 10 city council districts in Philadelphia where changes and development pressures will have to be accoutned for.

# philly permit data
dat_permit <- st_read("https://phl.carto.com/api/v2/sql?q=SELECT+*+FROM+permits&filename=permits&format=geojson&skipfields=cartodb_id")

# only keep new construction permits
newcon_permits <- dat_permit %>%
  filter(grepl("NEW CON|NEWCON",typeofwork)) %>%
  mutate(year = substr(permitissuedate, 1,4)) %>%
  filter(year %in% c(2013:2019,2022)) %>%
  st_transform(crs = 2272)

rm(dat_permit) # remove this big file because we won't use this anymore

city_council <- st_read("https://opendata.arcgis.com/api/v3/datasets/1ba5a5d68f4a4c75806e78b1d9245924_0/downloads/data?format=geojson&spatialRefId=4326&where=1%3D1")

city_council <- city_council %>% 
  mutate(DISTRICT = factor(DISTRICT, levels = 1:10))

# census data
acs_variable_list.2019 <- load_variables(2019, #year
                                         "acs5", #five year ACS estimates
                                         cache = TRUE) %>% 
  filter(geography == "block group")
    
# variables of interest:
# B19013_001: medHHincome
# B01001_001: total pop 
# B02001_002: white pop
# B25058_001: median rent
# B15003_022: attainment of bachelor's of population 25+
# B25071_001: Median gross rent as a percentage of household income (past year)
# B07201_001: Geographical Mobility in the Past Year for Current Residence--Metropolitan Statistical Area Level in the United States, estimated total
# B07201_002: Geographical Mobility in the Past Year for Current Residence--Metropolitan Statistical Area Level in the United States, same house 1 year ago 
# B25008_002: total population in occupied housing by tenure, owner occupied
# B25008_003: total population in occupied housing by tenure, renter occupied
# B99082_001: allocation of private vehicle occupancy
# B25044_001: tenure by vehicles available
# B09002_001: own children under 18
# B11004_001: related children under 18

census_vars <- paste0(c("B01001_001", "B19013_001", "B02001_002", "B25058_001", "B15003_022", "B25071_001", "B07201_001",
                        "B07201_002", "B25008_002", "B25008_003", "B99082_001", "B25044_001", "B09002_001", 
                        "B11004_001"),"E") # census variables of interest that are available

medHHinc2019 <- 70459 # source: https://www.deptofnumbers.com/income/pennsylvania/philadelphia/

tracts19 <- 
  get_acs(geography = "block group", 
          variables = census_vars, 
          year = 2019, state = 42,
          geometry = T, output = "wide") %>%
  st_transform(crs = 2272) %>%
  dplyr::select(!matches("M$")) %>% 
  rename(total_pop = B01001_001E,
         med_hh_inc = B19013_001E,
         white_pop = B02001_002E,
         med_rent = B25058_001E,
         bachelors25 = B15003_022E,
         pct_rent_hhinc = B25071_001E,
         mobility_tot_metro = B07201_001E,
         samehouse1yr_metro = B07201_002E,
         owner_occ = B25008_002E,
         renter_occ = B25008_003E,
         private_vehicle_occ = B99082_001E,
         vehicles_avail = B25044_001E,
  ) %>%
  mutate(children = B09002_001E + B11004_001E,
         pct_white = ifelse(total_pop > 0, white_pop / total_pop,0),
         RaceContext = ifelse(pct_white > 0.5, "Majority White", 
                              ifelse(total_pop != 0 , "Majority non-White", NA)),
         year = "2019") %>% 
  dplyr::select(!matches("^B|white_pop")) %>% 
  st_centroid()

# philly neighborhood data
nhoods_path <- 'https://raw.githubusercontent.com/azavea/geo-data/master/Neighborhoods_Philadelphia/Neighborhoods_Philadelphia.geojson'
nhoods <- st_read(nhoods_path, quiet = T) %>%
  st_transform(crs = 2272) %>%
  dplyr::select(mapname)

# philly bounds
philly <- st_read("https://opendata.arcgis.com/datasets/405ec3da942d4e20869d4e1449a2be48_0.geojson") %>%
  st_transform(crs = 2272) %>% 
  dplyr::select(OBJECTID,geometry)

# limit permit dat to philly border
newcon_permits <- newcon_permits %>% 
  .[philly,]

# 311 incident reports data
carto_url = "https://phl.carto.com/api/v2/sql"

# Crime incidents
table_name_311 = "public_cases_fc"

# query
where_311 = "closed_datetime >= '2010-01-01' AND closed_datetime < '2020-01-01' AND service_name IN ('Rubbish/Recyclable Material Collection','Illegal Dumping','Construction Complaints', 'Building Construction', 'Sanitation Violation', 'Street Trees', 'Dangerous Sidewalk', 'Homeless Encampment Request', 'Parks and Rec Safety and Maintenance', 'Vacant House or Commercial')"

query311 = paste("SELECT *",
                 "FROM", table_name_311,
                 "WHERE", where_311)

reports311 = rphl::get_carto(query311, format = "csv", base_url = carto_url, stringsAsFactors = F)
reports311 <- reports311 %>%
  mutate(Year = year(format.Date(reports311$closed_datetime)))
reports311 <- reports311 %>%
  filter(!is.na(lon) & !is.na(lat)) %>%
  st_as_sf(coords = c("lon","lat"), crs = 4326) %>%
  st_transform(crs = 2272) %>% 
  mutate(
    type = str_replace_all(service_name, "[^[:alnum:]]", ""),
    Legend = "311") %>% 
  dplyr::select(Legend, type, geometry, Year)
# plot city council districts
ggplot() +
  geom_sf(data=city_council, aes(fill=DISTRICT)) +
  scale_fill_viridis(discrete = T) +
  labs(fill = "City Council District",
       title = "City Council Districts Within Philadelphia",
       caption = "Figure 1.") +
  mapTheme()

Development patterns may be spread over space, such that it forms a smooth ‘mosaic’ across the city of Philadelphia. Alternatively, it is possible that several permits may be clustered together, either because they were all submitted as part of the same construction project (ie, submitting an electrical permit and roofing permit for the same property) or there were several units being worked in tandem. To better understand whether Philadelphia is experiencing a ‘mosaic’ of development or if development has been clustered into particular areas, a fishnet was established for grid cells of 500 meter by 500 meters across the city. The 500 meter scale (1/3rd mile) allows for a fluid understanding of development patterns at the block scale.

Figure 1 below shows the number of permits granted for each fishnet square in 2014, followed by Figure 2 showing the number of permits for each fishnet square in 2015, and so on until Figure 6 demonstrating the number of permits granted for each fishnet square in 2019. As is visible through the maps, trends and patterns of development shift across the city, such that in 2014 development centered around the Center City Neighborhood, while moving towards 2019 - when record number of permits were granted for the time period in focus - development was centered in Center City as well as North Philadelphia.

It is worthwhile to note that while there are a number of areas with extensive permit counts and development pressures, so too are there a significant number of fishnet cells in the city that did not experience much - if any - development. These low-no development areas, from the perspective of new construction, could have an outside influence on the model with impacts on its accuracy in areas experiecing higher levels of development. This was factored into the model through the use of a Poisson distribution, described in greater detail below.

Figures 2. Permit Count on Fishnet by Year

# our CRS is in feet, but want to make fishnet a 500m x 500m
cell_size <- 500 * 3.28084

fishnet <- 
  st_make_grid(philly,
               cellsize = cell_size,  
               square = TRUE,
               crs = 2272) %>% 
  .[philly] %>%
  st_sf() %>%              
  mutate(uniqueID = 1:n()) 

# 2013 
{
  permit_net13 <- 
    dplyr::select(newcon_permits %>% filter(year == "2013")) %>% 
    mutate(count_permits = 1) %>% 
    aggregate(., fishnet, sum) %>%
    mutate(count_permits13 = replace_na(count_permits, 0),
           uniqueID = as.numeric(rownames(.)),
           cvID = sample(round(nrow(fishnet) / 16), size=nrow(fishnet), replace = TRUE))
}

# 2014
{
  permit_net14 <- 
    dplyr::select(newcon_permits %>% filter(year == "2014")) %>% 
    mutate(count_permits = 1) %>% 
    aggregate(., fishnet, sum) %>%
    mutate(count_permits14 = replace_na(count_permits, 0),
           uniqueID = as.numeric(rownames(.)),
           cvID = sample(round(nrow(fishnet) / 16), size=nrow(fishnet), replace = TRUE)) %>% 
    dplyr::select(-count_permits)
}

# 2015
{
  permit_net15 <- 
    dplyr::select(newcon_permits %>% filter(year == "2015")) %>% 
    mutate(count_permits = 1) %>% 
    aggregate(., fishnet, sum) %>%
    mutate(count_permits15 = replace_na(count_permits, 0),
           uniqueID = as.numeric(rownames(.)),
           cvID = sample(round(nrow(fishnet) / 16), size=nrow(fishnet), replace = TRUE)) %>% 
    dplyr::select(-count_permits)
}

# 2016
{
  permit_net16 <- 
    dplyr::select(newcon_permits %>% filter(year == "2016")) %>% 
    mutate(count_permits = 1) %>% 
    aggregate(., fishnet, sum) %>%
    mutate(count_permits16 = replace_na(count_permits, 0),
           uniqueID = as.numeric(rownames(.)),
           cvID = sample(round(nrow(fishnet) / 16), size=nrow(fishnet), replace = TRUE)) %>% 
    dplyr::select(-count_permits)
}

# 2017
{
  permit_net17 <- 
    dplyr::select(newcon_permits %>% filter(year == "2017")) %>% 
    mutate(count_permits = 1) %>% 
    aggregate(., fishnet, sum) %>%
    mutate(count_permits17 = replace_na(count_permits, 0),
           uniqueID = as.numeric(rownames(.)),
           cvID = sample(round(nrow(fishnet) / 16), size=nrow(fishnet), replace = TRUE)) %>% 
    dplyr::select(-count_permits)
}

# 2018
{
  permit_net18 <- 
    dplyr::select(newcon_permits %>% filter(year == "2018")) %>% 
    mutate(count_permits = 1) %>% 
    aggregate(., fishnet, sum) %>%
    mutate(count_permits18 = replace_na(count_permits, 0),
           uniqueID = as.numeric(rownames(.)),
           cvID = sample(round(nrow(fishnet) / 16), size=nrow(fishnet), replace = TRUE)) %>% 
    dplyr::select(-count_permits)
}

# 2019
{
  permit_net19 <- 
    dplyr::select(newcon_permits %>% filter(year == "2019")) %>% 
    mutate(count_permits = 1) %>% 
    aggregate(., fishnet, sum) %>%
    mutate(count_permits19 = replace_na(count_permits, 0),
           uniqueID = as.numeric(rownames(.)),
           cvID = sample(round(nrow(fishnet) / 16), size=nrow(fishnet), replace = TRUE)) %>% 
    dplyr::select(-count_permits)
}

# combine permit counts from all years
permit_net_allyrs <- permit_net13 %>% 
  st_drop_geometry() %>%
  dplyr::select(uniqueID,cvID,count_permits13) %>% 
  left_join(permit_net14 %>% st_drop_geometry() %>% dplyr::select(uniqueID,count_permits14), by = "uniqueID") %>% 
  left_join(permit_net15 %>% st_drop_geometry() %>% dplyr::select(uniqueID,count_permits15), by = "uniqueID") %>% 
  left_join(permit_net16 %>% st_drop_geometry() %>% dplyr::select(uniqueID,count_permits16), by = "uniqueID") %>% 
  left_join(permit_net17 %>% st_drop_geometry() %>% dplyr::select(uniqueID,count_permits17), by = "uniqueID") %>% 
  left_join(permit_net18 %>% st_drop_geometry() %>% dplyr::select(uniqueID,count_permits18), by = "uniqueID") %>% 
  left_join(permit_net19 %>% st_drop_geometry() %>% dplyr::select(uniqueID,count_permits19), by = "uniqueID") %>% 
  left_join(permit_net13 %>% dplyr::select(-c(cvID,count_permits13,count_permits)), by = "uniqueID")

# map of permit count over time
permit_net_formap <- permit_net_allyrs %>% 
  dplyr::select(-c(uniqueID,cvID,count_permits14,count_permits16,count_permits18)) %>% 
  rename(`Permit Count 2013` = count_permits13,
         `Permit Count 2015` = count_permits15,
         `Permit Count 2017` = count_permits17,
         `Permit Count 2019` = count_permits19) %>% 
  gather("count_permits", "value", -geometry) %>% 
  st_as_sf()

permit_count_map <- ggplot(data = permit_net_formap) +
  geom_sf(aes(fill = value)) +
  scale_fill_viridis() +
  facet_wrap(~count_permits) +
  labs(title = "Count of New Construction Permits for the fishnet",
       caption = "Figure 2.") +
  mapTheme(title_size = 14) + theme(legend.position="bottom")

permit_count_map

Census Information

There are a number of different methods to account for neighborhood change and the impacts of development, all of which are powered. The GENTRISK takes into account several census features, which are described below (along with their reasoning)

  • Total Population: as population increases in a census tract, so too will housing pressures which can in turn lead to new construction of housing.
  • Median HH Income: as median household income increases, it is likely that the cost of amenities in the area, including housing, groceries, etc. will increase as well.
  • Median Rent: a clear marker of development pressure, rent prices will increase as a neighborhood becomes more expensive. It can also be used to ask ‘does higher rent in an area indcue more development’ as developers seek to capitalize on higher prices?
  • Staying in the same house for 1 year: a marker of how many long-term residents are around, this can be used to assess the relationships that development has with the transition out of longer term residents.
  • Renter Occupied (total)
  • Number of Vehicles available
  • Number of Children

The model powering GENTRISK requires that all of the variables are gathered in the same format and then compared across time. To facilitatate this process, the PHL GEO Inc. team used the following code to gather all of the variables and put them into a similar format. These variables can then provide an understnaidng of how neighborhoods are changing over time, followed by an investigation of what could be driving those changes; ie, if development is happening in a neighborhood, what are the factors that drive that development?

  • Building Construction complaints
  • Vacant Housing or Commercial complaints
  • Dangerous Sidewalk complaints
  • Illegal dumping complaints
  • Parks and Recreation Safety and Maintenance complaints
  • Rubbish / Recyclable Material Collection complaints
  • Street Tree maintenance complaints
  • And the distance between the centroid of each grid cell and the nearest 3 instances of the complaints listed above.
# function to make fishnet from acs dataframe (dat_tract), for variable (var_name), giving it the legend (var_name), aggregating with function (sum_or_mean)
# for example, to make a fishnet summing all total_pop variables for 2019, one would define the variables as follows:
# dat_tract <- tracts19; var_name <- "total_pop"; sum_or_mean <- sum
makenet <- function(dat_tract, var_name, sum_or_mean){
  
  net_tract <- dat_tract %>% 
    .[philly,] %>% 
    dplyr::select(matches(var_name)) %>% 
    mutate(Legend = var_name) %>% 
    st_join(fishnet, join=st_within) %>%
    st_drop_geometry() %>%
    group_by(uniqueID, Legend) %>%
    summarize(count = sum_or_mean(get(var_name), na.rm = T)) %>%
    left_join(fishnet, ., by = "uniqueID") %>%  # add geometry back in
    spread(Legend, count, fill=0) %>%  # fill in ones where fishnet was missing, count was NA with 0
    dplyr::select(-`<NA>`) %>%
    ungroup()
  
  return(net_tract)
}
  
var_names <- names(tracts19)[3:14]

# for loop to create fish nets for all ACS variables for all years
for (i in 1:length(var_names)){
    # select year, variable, and legend name
    var_name <- var_names[i]
    
    if (i %in% c(2:4,18)){
      sum_or_mean <- mean
    } else {
      sum_or_mean <- sum
    }
    
    # run function to make fishnet
    net_tract <- makenet(tracts19,var_name,sum_or_mean)
    
    # save as individual fishnet, uncomment if neededd
    # net_tract_name <- paste0("net_",var_name,yr)
    # assign(net_tract_name, net_tract)
    
    # save all variables into one big net
    if (i == 1) {
      census_net <- net_tract %>% 
        st_drop_geometry()
    } else {
      net_tract_tojoin <- net_tract %>% 
        st_drop_geometry() %>% 
        dplyr::select(1:2)
      census_net <- left_join(census_net, net_tract_tojoin, by = "uniqueID")
    }
  }
long_15 <- reports311 %>% 
  filter(Year == 2015) %>% 
  arrange(type) %>%  
  mutate(Legend = paste0(type, Year), .keep = "unused")

long_16 <- reports311 %>% 
  filter(Year == 2016) %>% 
  arrange(type) %>%  
  mutate(Legend = paste0(type, Year), .keep = "unused")

long_17 <- reports311 %>% 
  filter(Year == 2017) %>% 
  arrange(type) %>%  
  mutate(Legend = paste0(type, Year), .keep = "unused")

long_18 <- reports311 %>% 
  filter(Year == 2018) %>% 
  arrange(type) %>%  
  mutate(Legend = paste0(type, Year), .keep = "unused")

long_19 <- reports311 %>% 
  filter(Year == 2019) %>% 
  arrange(type) %>%  
  mutate(Legend = paste0(type, Year), .keep = "unused")

vars311_net <- rbind(long_15, long_16, long_17, long_18, long_19) %>% 
  st_join(fishnet, join = st_within) %>%
  st_drop_geometry() %>%
  group_by(uniqueID, Legend) %>%
  summarize(count = n()) %>% 
  left_join(fishnet, ., by = "uniqueID") %>%  # add geometry back in
  spread(Legend, count, fill=0) %>%  # fill in ones where fishnet was missing, count was NA with 0
  dplyr::select(-`<NA>`) %>%
  ungroup()

# use something like code below (from lab 6) to create nearest neighbor counts
# convenience to reduce length of function names.
st_c    <- st_coordinates
st_coid <- st_centroid

vars311_net_ccoid <- st_c(st_coid(vars311_net))

# add nearest neighbor features
vars311_net <- vars311_net %>%
  mutate(buildingCon15.dist = nn_function(vars311_net_ccoid, 
                                          st_c(long_15 %>% filter(Legend == "BuildingConstruction2015")), 3),
         dangerSide15.dist = nn_function(vars311_net_ccoid, 
                                         st_c(long_15 %>% filter(Legend == "DangerousSidewalk2015")), 3),
         illegalDump15.dist = nn_function(vars311_net_ccoid, st_c(long_15 %>% filter(Legend == "IllegalDumping2015")), 3),
         parksNrec15.dist = nn_function(vars311_net_ccoid, 
                                        st_c(long_15 %>% filter(Legend == "ParksandRecSafetyandMaintenance2015")), 3),
         materialColl15.dist = nn_function(
           vars311_net_ccoid, st_c(long_15 %>% filter(Legend == "RubbishRecyclableMaterialCollection2015")), 3),
         streetTrees15.dist = nn_function(vars311_net_ccoid, st_c(long_15 %>% filter(Legend == "StreetTrees2015")), 3),
         vacant15.dist = nn_function(vars311_net_ccoid, 
                                     st_c(long_15 %>% filter(Legend == "VacantHouseorCommercial2015")), 3),
         
         buildingCon16.dist = nn_function(vars311_net_ccoid, 
                                          st_c(long_16 %>% filter(Legend == "BuildingConstruction2016")), 3),
         dangerSide16.dist = nn_function(vars311_net_ccoid, 
                                         st_c(long_16 %>% filter(Legend == "DangerousSidewalk2016")), 3),
         illegalDump16.dist = nn_function(vars311_net_ccoid, st_c(long_16 %>% filter(Legend == "IllegalDumping2016")), 3),
         parksNrec16.dist = nn_function(vars311_net_ccoid, 
                                        st_c(long_16 %>% filter(Legend == "ParksandRecSafetyandMaintenance2016")), 3),
         materialColl16.dist = nn_function(
           vars311_net_ccoid, st_c(long_16 %>% filter(Legend == "RubbishRecyclableMaterialCollection2016")), 3),
         streetTrees16.dist = nn_function(vars311_net_ccoid, st_c(long_16 %>% filter(Legend == "StreetTrees2016")), 3),
         vacant16.dist = nn_function(vars311_net_ccoid, 
                                     st_c(long_16 %>% filter(Legend == "VacantHouseorCommercial2016")), 3),
         
         buildingCon17.dist = nn_function(vars311_net_ccoid, 
                                          st_c(long_17 %>% filter(Legend == "BuildingConstruction2017")), 3),
         dangerSide17.dist = nn_function(vars311_net_ccoid, 
                                         st_c(long_17 %>% filter(Legend == "DangerousSidewalk2017")), 3),
         illegalDump17.dist = nn_function(vars311_net_ccoid, st_c(long_17 %>% filter(Legend == "IllegalDumping2017")), 3),
         parksNrec17.dist = nn_function(vars311_net_ccoid, 
                                        st_c(long_17 %>% filter(Legend == "ParksandRecSafetyandMaintenance2017")), 3),
         materialColl17.dist = nn_function(
           vars311_net_ccoid, st_c(long_17 %>% filter(Legend == "RubbishRecyclableMaterialCollection2017")), 3),
         streetTrees17.dist = nn_function(vars311_net_ccoid, st_c(long_17 %>% filter(Legend == "StreetTrees2017")), 3),
         vacant17.dist = nn_function(vars311_net_ccoid, 
                                     st_c(long_17 %>% filter(Legend == "VacantHouseorCommercial2017")), 3),
         
         buildingCon18.dist = nn_function(vars311_net_ccoid, 
                                          st_c(long_18 %>% filter(Legend == "BuildingConstruction2018")), 3),
         dangerSide18.dist = nn_function(vars311_net_ccoid, 
                                         st_c(long_18 %>% filter(Legend == "DangerousSidewalk2018")), 3),
         illegalDump18.dist = nn_function(vars311_net_ccoid, st_c(long_18 %>% filter(Legend == "IllegalDumping2018")), 3),
         parksNrec18.dist = nn_function(vars311_net_ccoid, 
                                        st_c(long_18 %>% filter(Legend == "ParksandRecSafetyandMaintenance2018")), 3),
         materialColl18.dist = nn_function(
           vars311_net_ccoid, st_c(long_18 %>% filter(Legend == "RubbishRecyclableMaterialCollection2018")), 3),
         streetTrees18.dist = nn_function(vars311_net_ccoid, st_c(long_18 %>% filter(Legend == "StreetTrees2018")), 3),
         vacant18.dist = nn_function(vars311_net_ccoid, 
                                     st_c(long_18 %>% filter(Legend == "VacantHouseorCommercial2018")), 3),
         
         buildingCon19.dist = nn_function(vars311_net_ccoid, 
                                          st_c(long_19 %>% filter(Legend == "BuildingConstruction2019")), 3),
         dangerSide19.dist = nn_function(vars311_net_ccoid, 
                                         st_c(long_19 %>% filter(Legend == "DangerousSidewalk2019")), 3),
         illegalDump19.dist = nn_function(vars311_net_ccoid, st_c(long_19 %>% filter(Legend == "IllegalDumping2019")), 3),
         parksNrec19.dist = nn_function(vars311_net_ccoid, 
                                        st_c(long_19 %>% filter(Legend == "ParksandRecSafetyandMaintenance2019")), 3),
         materialColl19.dist = nn_function(
           vars311_net_ccoid, st_c(long_19 %>% filter(Legend == "RubbishRecyclableMaterialCollection2019")), 3),
         streetTrees19.dist = nn_function(vars311_net_ccoid, st_c(long_19 %>% filter(Legend == "StreetTrees2019")), 3),
         vacant19.dist = nn_function(vars311_net_ccoid, 
                                     st_c(long_19 %>% filter(Legend == "VacantHouseorCommercial2019")), 3)) 

Feature Selection

All census variables were pulled at the block group level, which is a smaller scale than our fishnet. There is no perfect way to join block-group-level census data to our fishnet, so our approach was to take the centroid of the block groups and aggregated the variables (by either taking the sum or average) of the block groups whose centroids landed in a fishnet grid cell (idk how to phrase this well so feel free to rephrase!). As a result, some fishnet cells that cover residential areas of Philadelphia may have a value of 0 for variables like “Total Population” since no block group centroid was within the bounds of that cell.

# join census vars fishnet to permit count net + 311 net
all_net <- left_join(permit_net_allyrs, census_net, by = "uniqueID") %>% 
  left_join(vars311_net %>% st_drop_geometry(), by = "uniqueID") %>% 
  st_as_sf()

Figure 3. Correlation Matrix of all possible risk factors

# making this moreso to see which would actually be good predictors
var_for_model <- c("uniqueID","cvID","count_permits19","count_permits18","count_permits17","count_permits16",
                   "count_permits15","BuildingConstruction2019","IllegalDumping2019","DangerousSidewalk2019",
                   "RubbishRecyclableMaterialCollection2019","vehicles_avail","med_rent","renter_occ",
                   "VacantHouseorCommercial2019","total_pop","samehouse1yr_metro","StreetTrees2019",
                   "children","parksNrec19.dist")

all_net19 <- all_net %>% 
  dplyr::select(all_of(var_for_model))

# only keep years for count_permit columns
names(all_net19) <- c(str_replace_all(var_for_model, c("2019" = "","19." = ".")), "geometry")

for_cormat19 <- all_net19 %>% 
  st_drop_geometry() %>% 
  dplyr::select(-c(uniqueID,cvID))

ggcorrplot(
  round(cor(for_cormat19), 1), 
  p.mat = cor_pmat(for_cormat19),
  colors = c("#4b2875", "white", "#9c1339"),
  type="lower",
  insig = "blank",
  digits = 4,
  lab = T, lab_size = 2) +  
  labs(title = "Correlation Matrix for 2019 fishnet",
       caption = "Figure 3.") 

Spatial Process of Permit Count

We also added some features that explain the spatial process of permit issuance in Philadelphia. Using local Moran’s I, we examined the spatial process which helps us understand whether the permit count at a certain location is randomly distributed or clustered relative to its immediate neighbors.

Models

Since the outcome we are interested in is a count or sum of vandalism incidents, I use a Poisson regression model to estimate the outcome. I created two types of models – one with just risk factors as the predictors and another with both risk factors and spatial process (I ultimately only kept the model with both risk factors and spatial process).

all_net19 %>% ggplot(aes(x = count_permits19)) +
  geom_histogram(bins = 66, fill = viridis::cividis(1)) +
  theme(text = element_text( color = "black"),
    plot.title = element_text(size = 14, colour = "black"), 
    plot.subtitle = element_text(face="italic"),
    plot.caption = element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),
    panel.grid.major = element_line("grey80", size = 0.1),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill=NA, size=2),
    strip.background = element_rect(fill = "grey80", color = "white"),
    strip.text = element_text(size=12),
    axis.title = element_text(size=12),
    axis.text = element_text(size=10),
    plot.background = element_blank(),
    legend.background = element_blank(),
    legend.title = element_text(colour = "black", face = "italic"),
    legend.text = element_text(colour = "black", face = "italic"),
    strip.text.x = element_text(size = 14)) +
  labs(title = "Permit Distribution",
       x = "Count of New Construction Permits", y = "Count",
       caption = "Figure 4.")

Validation

To assess the validity of my models I look at the accuracy (how accurate are my predictions in the set that the models are trained on?) and more importantly, generalizability (can I predict counts of vandalism across different neighborhoods? For another year?).

  1. Accuracy

To assess accuracy of my models, I examined the mean absolute error (MAE; where error = predicted value - observed value) in the dataset containing vandalism incidents for 2021 – the same data we used to train the model.

  1. Generalizability

The most important validity measure of a geospatial risk prediction model is the generalizability as we want to be able to train a model and predict future outcomes, in our case counts of vandalism. I use random k-fold (with k ~ 100) cross validation as well as ‘Leave-one-group-out’ cross-validation (LOGO-CV) to assess model generalizability. LOGO-CV helps us assess model generalizability across neighborhoods. I also predict vandalism risk scores for the following year (2022) to assess whether the model generalizes across time using the 2021 kernel density.

Results

Visualizing Predictors and Local Moran’s I

[description of figure 4]

Figure 5. Predictors in final model

# add neighborhood names
allnet19_formodel <- all_net19 %>% 
  st_centroid() %>%
  st_join(dplyr::select(nhoods, mapname)) %>%
  st_drop_geometry() %>%
  left_join(dplyr::select(all_net19, geometry, uniqueID), by = "uniqueID") %>%
  st_sf() 

vars_net.long <- gather(allnet19_formodel %>% dplyr::select(-count_permits19),
                            variable, value, -geometry, -uniqueID, -cvID, -mapname)

vars <- unique(vars_net.long$variable)
varList <- list()

for (i in vars) {
  varList[[i]] <- ggplot() +
      geom_sf(data = filter(vars_net.long, variable == i),aes(fill = value), colour=NA) +
      scale_fill_viridis(name = "") +
      labs(title = i) +
      theme(text = element_text( color = "black"),
            plot.title = element_text(size = 14,colour = "black"),
            plot.subtitle=element_text(face="italic"),
            plot.caption=element_text(hjust=0),
            axis.ticks = element_blank(),
            panel.background = element_blank(),axis.title = element_blank(),
            axis.text = element_blank(),
            axis.title.x = element_blank(),
            axis.title.y = element_blank(),
            panel.grid.minor = element_blank(),
            panel.border = element_rect(colour = "black", fill=NA, size=2),
            strip.text.x = element_text(size = 7),
            legend.position = "bottom",
            legend.text = element_text(size = 5))
  }

do.call(grid.arrange,c(varList, ncol = 3, top = "Predictors of Permit Count (on fishnet)", bottom = "Figure 5."))

[description of figure 6]

Figure 6. Adding indicators of significant spatial processes

final_net.nb <- poly2nb(as_Spatial(allnet19_formodel), queen=TRUE)
final_net.weights <- nb2listw(final_net.nb, style="W", zero.policy=TRUE) # turn neighborhood weights into list of weights

local_morans <- localmoran(allnet19_formodel$count_permits19, final_net.weights, zero.policy=TRUE) %>%
  as.data.frame() # Ii moran's I at ith cell, Ei expected/mean from neighbors

# join local Moran's I results to fishnet
final_net.localMorans <- 
  cbind(local_morans, as.data.frame(allnet19_formodel)) %>% 
  st_sf() %>%
  dplyr::select(`Permit Count` = count_permits19, 
                `Local Morans I` = Ii, 
                `P Value` = `Pr(z != E(Ii))`) %>%
  mutate(`Significant Hotspots` = ifelse(`P Value` <= 0.001, 1, 0)) %>%
  gather(variable, value, -geometry)

# now plot
vars <- unique(final_net.localMorans$variable)
varList <- list()

for(i in vars){
  varList[[i]] <- 
    ggplot() +
    geom_sf(data = filter(final_net.localMorans, variable == i),aes(fill = value), colour=NA) +
    scale_fill_viridis(name="") +
    labs(title=i) +
    mapTheme(title_size = 14) + theme(legend.position="bottom")
}

do.call(grid.arrange,c(varList, ncol = 2, top = "Local Moran's I Statistics for Permit Count in Philadelphia", 
                       bottom = "Figure 6."))

final_net <-
  allnet19_formodel %>% 
  mutate(permitct.isSig = 
           ifelse(localmoran(allnet19_formodel$count_permits19, 
                             final_net.weights)[,5] <= 0.001, 1, 0)) %>%
  mutate(permitct.isSig.dist = 
           nn_function(st_coordinates(st_centroid(allnet19_formodel)),
                       st_coordinates(st_centroid(
                         filter(allnet19_formodel, permitct.isSig == 1))), 1))

[description of figure 6]

Figure 7. Scatterplots of Predictors

correlation.long <-
  st_drop_geometry(final_net) %>%
  dplyr::select(-uniqueID, -cvID, -mapname) %>%
  gather(variable, value, -count_permits19)

correlation.cor <-
  correlation.long %>%
  group_by(variable) %>%
  summarize(correlation = cor(value, count_permits19, use = "complete.obs"))
 
ordered_vars <- c("count_permits18","count_permits17","count_permits16","count_permits15","total_pop","children",
                  "med_rent","renter_occ","samehouse1yr_metro","vehicles_avail","BuildingConstruction","IllegalDumping",
                  "DangerousSidewalk","StreetTrees","RubbishRecyclableMaterialCollection","VacantHouseorCommercial",
                  "parksNrec.dist","permitct.isSig","permitct.isSig.dist")

ggplot(correlation.long %>% 
         mutate(variable = factor(variable, levels = ordered_vars)), 
       aes(value, count_permits19)) +
  geom_point(size = 0.1) +
  geom_text(data = correlation.cor %>% 
              mutate(variable = factor(variable, levels = ordered_vars)), 
            aes(label = paste("r =", round(correlation, 2))),
            x=-Inf, y=Inf, vjust = 1.5, hjust = -.1) +
  geom_smooth(method = "lm", se = FALSE, colour = "black") +
  facet_wrap(~variable, ncol = 3, scales = "free") +
  labs(title = "Permit count as a function of predictors",
       caption = "Figure 7.") +
  theme(text = element_text( color = "black"),
        plot.title = element_text(size = 14, colour = "black"), 
        plot.subtitle = element_text(face="italic"),
        plot.caption = element_text(hjust=0),
        axis.ticks = element_blank(),
        panel.background = element_blank(),
        panel.grid.major = element_line("grey80", size = 0.1),
        panel.grid.minor = element_blank(),
        panel.border = element_rect(colour = "black", fill=NA, size=2),
        strip.background = element_rect(fill = "grey80", color = "white"),
        strip.text = element_text(size=12),
        axis.title = element_text(size=12),
        axis.text = element_text(size=10),
        plot.background = element_blank(),
        legend.background = element_blank(),
        legend.title = element_text(colour = "black", face = "italic"),
        legend.text = element_text(colour = "black", face = "italic"),
        strip.text.x = element_text(size = 10))

# just risk factors
reg.vars <- c("count_permits18","count_permits17","count_permits16","count_permits15","BuildingConstruction",
              "IllegalDumping","DangerousSidewalk","RubbishRecyclableMaterialCollection","vehicles_avail",
              "med_rent","renter_occ","VacantHouseorCommercial","total_pop","samehouse1yr_metro","StreetTrees",
              "children","parksNrec.dist") 

# random k-fold CV
reg.CV <- crossValidate(dataset = final_net,
                        id = "cvID",
                        dependentVariable = "count_permits19",
                        indVariables = reg.vars) %>%
  dplyr::select(cvID, count_permits19, Prediction, geometry) %>% 
  mutate(error = count_permits19 - Prediction)

# MAE
reg.MAE <- mean(abs(reg.CV$error)) # ~7.59063

# with local Moran's I spatial process features
reg.sp.vars <- c(reg.vars,"permitct.isSig","permitct.isSig.dist")

## RUN REGRESSIONS
reg.spatialCV <- crossValidate(
  dataset = final_net,
  id = "cvID",                           
  dependentVariable = "count_permits19",
  indVariables = reg.sp.vars) %>% 
  dplyr::select(cvID, count_permits19, Prediction, geometry) %>% 
  mutate(error = count_permits19 - Prediction)

# MAE
reg.spatial.MAE <- mean(abs(reg.spatialCV$error)) # ~5.589619

# LOGO-CV
# needed to redefine crossValidate to solve the issue with a neighborhood of NA
# this function also works for the previous lines using crossValidate()
crossValidate <- function(dataset, id, dependentVariable, indVariables) {
  
  allPredictions <- data.frame()
  # cvID_list <- unique(dataset[[id]])
  cvID_list <- as.character(na.omit(unique(dataset[[id]])))
  
  for (i in cvID_list) {
    
    thisFold <- i
    cat("This hold out fold is", thisFold, "\n")
    
    fold.train <- filter(dataset, dataset[[id]] != thisFold) %>% as.data.frame() %>% 
      dplyr::select(id, geometry, all_of(indVariables),
                    all_of(dependentVariable))
    fold.test  <- filter(dataset, dataset[[id]] == thisFold) %>% as.data.frame() %>% 
      dplyr::select(id, geometry, all_of(indVariables),
                    all_of(dependentVariable))
    
    form_parts <- paste0(dependentVariable, " ~ ", paste0(indVariables, collapse = "+"))
    form <- as.formula(form_parts)
    regression <- glm(form, family = "poisson",
                      data = fold.train %>%
                        dplyr::select(-geometry, -id))
    
    thisPrediction <-
      mutate(fold.test, Prediction = predict(regression, fold.test, type = "response"))
    
    allPredictions <-
      rbind(allPredictions, thisPrediction)
    
  }
  return(st_sf(allPredictions))
}

# adding neighborhood for LOGO CV, risk factors only
reg.logoCV <- crossValidate(
  dataset = final_net,
  id = "mapname",
  dependentVariable = "count_permits19",
  indVariables = reg.vars) %>%
  dplyr::select(cvID = mapname, count_permits19, Prediction, geometry) %>% 
  mutate(error = count_permits19 - Prediction)

# MAE
reg.logo.MAE <- mean(abs(reg.logoCV$error)) # 8.290278

# risk factors + spatial process
reg.logo.spatialCV <- crossValidate(
  dataset = final_net,
  id = "mapname",                           
  dependentVariable = "count_permits19",
  indVariables = reg.sp.vars) %>% 
  dplyr::select(cvID = mapname, count_permits19, Prediction, geometry) %>% 
  mutate(error = count_permits19 - Prediction)

# MAE
reg.logo.spatial.MAE <- mean(abs(reg.logo.spatialCV$error)) # 6.158161

Model Selection

Cross Validation

[description of figure 8 & 9]

Figure 8. Map of Model Errors

reg.summary <- rbind(
  mutate(reg.CV,
         Error = Prediction - count_permits19,
         Regression = "Random k-fold CV: Predictors"),
  mutate(reg.spatialCV,
         Error = Prediction - count_permits19,
         Regression = "Random k-fold CV: Predictors with Spatial Process"),
  mutate(reg.logoCV, 
         Error = Prediction - count_permits19,
         Regression = "Spatial LOGO-CV: Predictors"),
  mutate(reg.logo.spatialCV, 
         Error = Prediction - count_permits19,
         Regression = "Spatial LOGO-CV: Predictors with Spatial Process")) %>%
  st_sf() 

error_by_reg_and_fold <- 
  reg.summary %>%
  group_by(Regression, cvID) %>% 
  summarize(Mean_Error = mean(Prediction - count_permits19, na.rm = T),
            MAE        = mean(abs(Mean_Error), na.rm = T),
            SD_MAE     = mean(abs(Mean_Error), na.rm = T)) %>%
  ungroup()

# make map
# adjust mapTheme in order to fit title
mapTheme <- function(base_size = 12, title_size = 16) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = title_size,colour = "black"),
    plot.subtitle=element_text(face="italic"),
    plot.caption=element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),axis.title = element_blank(),
    axis.text = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill=NA, size=2),
    strip.text.x = element_text(size = 10))
}

error_by_reg_and_fold %>%
  ggplot() +
  geom_sf(aes(fill = MAE)) +
  facet_wrap(~Regression) +
  scale_fill_viridis() +
  labs(title = "Errors by Cross Validation method",
       caption = "Figure 8.") +
  mapTheme(title_size = 14) + theme(legend.position="bottom")

Figure 9. Predicted vs Real 2019 Permit Count

Here, we show the predicted vs real counts of new construction permits in 2019 (see figure 9).

Figure 10. Bar Plots of Error

# adjust plotTheme to fit titles
plotTheme <- function(base_size = 12, title_size = 16) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = title_size, colour = "black"), 
    plot.subtitle = element_text(face="italic"),
    plot.caption = element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),
    panel.grid.major = element_line("grey80", size = 0.1),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill=NA, size=2),
    strip.background = element_rect(fill = "grey80", color = "white"),
    strip.text = element_text(size=12),
    axis.title = element_text(size=12),
    axis.text = element_text(size=10),
    plot.background = element_blank(),
    legend.background = element_blank(),
    legend.title = element_text(colour = "black", face = "italic"),
    legend.text = element_text(colour = "black", face = "italic"),
    strip.text.x = element_text(size = 10)
  )
}

error_by_reg_and_fold %>%
  ggplot(aes(MAE)) + 
  geom_histogram(bins = 30, colour="black", fill = "#FDE725FF") +
  facet_wrap(~Regression) +  
  geom_vline(xintercept = 0) + scale_x_continuous(breaks = seq(0, 450, by = 50)) + 
  labs(title="Distribution of MAE", subtitle = "k-fold cross validation vs. LOGO-CV",
       x="Mean Absolute Error",     y="Count",
       caption = "Figure 10.") +
  plotTheme(title_size = 14) + theme(legend.position="bottom")

[description of table 1] Table 1 below shows the summary of regressions, including the mean and standard deviation of MAE for both of the models and both of the cross validation methods. We highlighted the row showing the results of both cross validation methods for the model including spatial process, to make clear how much these spatial process features improved our model.

Table 1. Summary of Regressions

st_drop_geometry(error_by_reg_and_fold) %>%
  group_by(Regression) %>% 
  summarize(`Mean MAE` = round(mean(MAE), 2),
            `SD MAE` = round(sd(MAE), 2)) %>%
  kable(caption = "Table 1: Summary of Regressions") %>%
  kable_styling("striped", full_width = F) %>% 
  row_spec(c(2,4), bold = T, color = "white", background = viridis::cividis(1)) %>% 
  kable_classic(full_width = F, html_font = "Cambria")
Table 1: Summary of Regressions
Regression Mean MAE SD MAE
Random k-fold CV: Predictors 4.49 13.03
Random k-fold CV: Predictors with Spatial Process 2.76 4.15
Spatial LOGO-CV: Predictors 10.11 31.73
Spatial LOGO-CV: Predictors with Spatial Process 6.64 14.71

Racial Context

In order to assess if the model generalizes to different neighborhood contexts, I specifically tested whether the model generalizes to a racial context. Using 2021 US Census data, I defined a neighborhood to be “Majority White” if over 50% of the population was White, and “Majority non-White” otherwise.

[description of table 2]

Table 2. Racial Context

RaceContext <- tracts19 %>% 
  dplyr::select(GEOID,total_pop,RaceContext,geometry) %>% 
  .[nhoods,]

reg.summary %>% 
  st_centroid() %>%
  st_join(RaceContext %>% dplyr::select(RaceContext, geometry)) %>%
  na.omit() %>%
  st_drop_geometry() %>%
  group_by(Regression, RaceContext) %>%
  summarize(mean.Error = mean(Error, na.rm = T)) %>%
  spread(RaceContext, mean.Error) %>%
  kable(caption = "Table 2. Mean Error by Neighborhood Racial Context") %>%
  kable_styling("striped", full_width = F) %>% 
  kable_classic(html_font = "Cambria")
Table 2. Mean Error by Neighborhood Racial Context
Regression

Table 2. Income Context

need to work on this.

# IncomeContext <- tracts22 %>% 
#   dplyr::select(GEOID,total_pop,IncomeContext,geometry) %>% 
#   .[nhoods,]
# 
# reg.summary %>% 
#   st_centroid() %>%
#   st_join(tracts22 %>% dplyr::select(IncomeContext, geometry)) %>%
#   na.omit() %>%
#   st_drop_geometry() %>%
#   group_by(Regression, IncomeContext) %>%
#   summarize(mean.Error = mean(Error, na.rm = T)) %>%
#   spread(IncomeContext, mean.Error) %>%
#   kable(caption = "Table 2. Mean Error by Neighborhood Income Context") %>%
#   kable_styling("striped", full_width = F) %>% 
#   kable_classic(html_font = "Cambria")

Conclusion

In conclusion….